Analysis of Japanese Compound Nouns using Collocational Information
نویسندگان
چکیده
Analyzing compound nouns is one of the crucial issues for natural language processing systems, in particular for those systems that aim at a wide coverage of domaius. In this paper, we propose a mcthod to analyze structures of Japanese compound nouns by using both word collocations statistics and a thesaurus. An experiment is conducted with 160,000 word collocations to analyze comlmund nouns of with an average length of 4.9 characters. The accuracy of this method is about 80%. 1 I n t r o d u c t i o n Analyzing compound nouns is one of the crucial issues for natural language processing systems, in particular for those systems that aim at a wide coverage of domains. Registering all compound nouns in a dictionary is an impractical attproach, since we can create a n e w conll)ound lloun by conlbluing nouns. Therefore, a mechanism to analyze the structure of a con,pound noun front the individual nouns is necessary. In order to identify structures of a compound noun, we must first find a set of words that compose the compound noun. This task is trivial for languages such as English, where words are separated by spaces. The situation is worse, however, in Japanese where no spaces are placed betwem, words. The process to identify word boundaries is usually called segmentation. In processing languages such as Japanese, ambiguities in segmentation should be resolved at the same time as ann° lyzing structure. I"or instance, thc Japanese compound noun "$)~]llJ~;I~"(ncw indirect tax), produces t6 (= 2 4) segcmentations possibilities for this case. (By consulting a /lai)anese dictionary, we would filter out some.) In this case, we have two remaining possibilities: "50[" ( n e w ) / ~ (type)/lllJ~'}~ (indirect)/ t~ (tax)" and ")~#~ (new)/lll]~)~ (indirec t ) / :~, (tax). ' ' i Wc nmst choose the correct segmentation, "~)?~'J. (new)/llll}~ ( ind i rec t ) /~ (tax)" and analyze structure. 1 Here "/" denotes ~L bound~try of words, Segmentation of Jal)anese is difficult only when using syntactic knowledge. Therefore, we could not always expect a sequence of correctly segmented words as an input to structure analysis. The information of structures is also expected to improve segmentation accuracy. There are several researches that are attacking this problem, l)'uzisaki et al. applied the I tMM model to scg,nentatimt and probabilistic CFG to analyzing the structure of compound nouns [3]. The accuracy of their method is 73% in identifying correct structures of kanzi character sequences with average length is 4.2 characters. In their approach, word boundaries are identified through tmrely statistical information (the IIMM model) without regarding such linguistic knowledge, as dictionaries. Therefore, the HMM nrodel may suggest an improper character sequence as a word. Purthermore, since nonterminal symbols of CFG are derived from a statistical analysis of word collocations, their number tends to be large and so the muuber of CFG rules are also large. They assumed COml)ound nouns consist of only one character words and two character words. It is questionable whether this method can be extended to handle cases that include nmre than two character words without lowering accuracy. ht this palter , we protmsc a method to analyze structures of Japanese compound nouns 1)y using word collocational information and a thesaurus. The callocational information is acquired from a corpus of four kanzi character words. The outline of procedures to acquire the collocational information is as follows: • extract collocations of nouns from a corpus of four kanzi character words • replace each noun in the collocations with thesaurus categories, to obtain the collocatkms of thesaurus categories • count occurrence frequencies for each collocational pattern of thesaurus catcgorics For each possible structure of a compound noun, the preference is calculated based on this colloo cational information and the structure with the highest score wins.
منابع مشابه
Collocations of Complex Nouns: Evidence for Lexicalisation
This paper combines a corpus-based study of noun+verb collocations with an attempt to distinguish compositional, regularly formed compounds from lexicalised ones. We claim that morphologically regular, compositional compounds share most of their collocational preferences with their compound heads, whereas lexicalised compounds have their own collocational preferences, distinct or only marginall...
متن کاملAnalysis of Japanese Compound Nouns by Direct Text Scanning
This paper aims to analyze word dependency structure in compound nouns appearing in Japanese newspaper articles. The analysis is a dil't:icult problem because such compound nouns can be quite long, have no word boundaries between contained nouns, and often contain nnregistered words such as abbreviations. The nonsegmentation property and unregistered words cause initial segmentation errors whic...
متن کاملAn Analysis of Persian Compound Nouns as Constructions
In Construction Morphology (CM), a compound is treated as a construction at the word level with a systematic correlation between its form and meaning, in the sense that any change in the form is accompanied by a change in the meaning. Compound words are coined by compounding templates which are called abstract schemas in CM. These abstract constructional schemas generalize over sets of existing...
متن کاملMeasuring the Similarity between Compound Nouns in Different Languages Using Non-Parallel Corpora
This paper presents a method that measures the similarity between compound nouns in different languages to locate translation equivalents from corpora. The method uses information from unrelated corpora in different languages that do not have to be parallel. This means that many corpora can be used. The method compares the contexts of target compound nouns and translation candidates in the word...
متن کامل